For this report, I used the epuRate’s template of Yan Holtz who made it publicly available.
Welcome to the second project about Networks. This time, we’re going to head into the world of cinema, and not just any cinema… that of the 60s and 70s!
Before diving into the project, let’s present a bit the data. The data set concerns the collaboration of composers and film producers in Hollywood from 1960 to 1970. When a producer wants to produce a film, he often asks a composer for help with the soundtracks. Indeed, music is something that is very important for a film. Who doesn’t know the music of the Lord of the Rings or Star Wars?
So here we have Composers that produce soundtracks for Producers. The size of the edges (the link between the names) represents the number of collaborations they had together.
The first thing to do is obviously to prepare the data. Here, the file extension does not allow to import the data in a clean way, like with a csv file for example.
Everything was imported as a string, and I had to do a lot of manipulation to get tidy data and to be able to represent it in a graph.
If you’re interested to know how I did it, you can click on the “code” button on the bottom right.
Ok, it’s time for the first visualization of the data.
# Import the data as a text string
movies <- readtext("data/Movies.paj")
# Split the string into a vector
movies_text <- movies %>%
mutate(text = str_split(text, "\n", simplify = TRUE)) %>%
pull(text) %>%
t() %>%
as_tibble()
# Extracting the nodes values
nodes_tibble <- movies_text %>%
filter(TRUE == str_detect(V1, pattern = "\"(.*)\"")) %>% # select only the nodes rows
mutate(V1 = str_extract(V1, pattern = "[:digit:]+[:space:]\"(.*)\""), # select only the data of interest (node_id + name)
V1 = str_replace_all(V1, pattern = "\"", replacement = "")) %>% # remove the hyphens
separate(V1, c("node_id","name"),extra = "merge") %>% # separate into two columns
mutate(node_id = as.numeric(node_id))
# Extracting the edges values
edges_tibble <- movies_text %>%
filter(FALSE == str_detect(V1, pattern = "[:alpha:]")) %>% # select only edges rows
mutate(V1 = str_trim(V1, side = "both")) %>% # removing extra space
separate(V1, c("to", "from", "n_collabs")) %>% # separating into 3 columns
na.omit() %>% # remove NAs
mutate_all(.funs = as.numeric) # change to numeric variables
# transform data into one suitable "tbl_object"
tidy_tree <- tbl_graph(
nodes = nodes_tibble,
edges = edges_tibble,
directed = TRUE
)
# filter out the unconnected nodes
tidy_tree <- tidy_tree %>%
filter(node_id != which(node_is_isolated()))# Creating a variable composer vs producer
tidy_tree <- tidy_tree %>%
activate(nodes) %>%
mutate(Profession = case_when(node_id %in% edges_tibble$from ~ "Composer",
node_id %in% edges_tibble$to ~ "Producer"))
# Plot the first network
ggraph(tidy_tree) +
geom_edge_link2(
aes(edge_width = n_collabs),
lineend = 'round',
arrow = arrow(length = unit(3, "mm")),
show.legend = FALSE,
end_cap = rectangle(width = 2.2, height = 1, width_unit = "cm", height_unit = "cm"),
start_cap = rectangle(width = 2.2, height = 1, width_unit = "cm", height_unit = "cm"),
color = "white",
alpha = 0.3
) +
geom_node_label(aes(label = name, fill = Profession),
size = 3,
colour = 'white',
alpha = 0.8,
repel = TRUE
) +
scale_fill_manual(values = c('Composer' = "#727B7C", 'Producer' = "#69292B")) +
labs(title = "Hollywood Network",
subtitle = "Collaboration between composer and producer to produce films",
caption = "Source: Ucinet Software") +
theme_graph() +
theme(
plot.background = element_rect(fill = "#212121"),
plot.title = element_text(color = "white", size = 40),
plot.subtitle = element_text(color = "white", size = 20),
legend.position = "bottom",
legend.direction = "vertical",
legend.text = element_text(color = "white", size = 14),
legend.title = element_text(color = "white", size = 18),
plot.caption = element_text(color = "white", size = 12)
) +
coord_flip()As mentioned above, we can see the number of collaborations between producer and composer by the size of the arrow. As there are many people, it is difficult to represent this Network properly. Let’s try to focus on the producers. In particular, let’s try to bring together producers who have worked with the same composers.
# Create the edges for the connection between producer
edges_producer <- edges_tibble %>%
group_by(from) %>% # group by composer
summarise(from = list(to), to = list(to)) %>% #create two lists variables from producer that have worked with the same composer
unnest(from) %>% # transforming the lists into one row per element in the "from" column
unnest(to) # the same transformation for the "to" column
# Create a tbl_graph with the new data
producer_tree <- tbl_graph(
nodes = nodes_tibble,
edges = edges_producer,
directed = FALSE) %>%
activate(edges) %>%
convert(to_simple) %>% # eliminate doubles and loops
activate(nodes) %>%
filter(TRUE != node_is_isolated()) # filter the nodes to get only producer
# Plot the producer graph
producer_tree %>%
ggraph(layout = "igraph", algorithm = "kk") +
geom_edge_fan(
lineend = 'round',
show.legend = FALSE,
edge_width = 0.5,
color = "white",
alpha = 0.3
) +
geom_node_label(aes(label = name),
fill = "#69292B",
size = 4,
colour = 'white',
alpha = 0.8,
repel = FALSE
) +
labs(title = "Hollywood Network of Producer",
subtitle = "Producer that worked with the same composer",
caption = "Source: Ucinet Software") +
theme_graph() +
theme(
plot.background = element_rect(fill = "#212121"),
plot.title = element_text(color = "white", size = 40),
plot.subtitle = element_text(color = "white", size = 20),
plot.caption = element_text(color = "white", size = 12),
legend.position = "none",
) The producers are connected if they have collaborated with the same composer. Are there clusters in the network? Are there groups of producers who all tend to work with the same composer? Let’s find out with a clustering algorithm.
# Add a cluster group
producer_cluster <- producer_tree %>%
activate(nodes) %>%
mutate(group = as.factor(group_fast_greedy()))
my_graph <- producer_cluster %>%
ggraph(layout = "backbone") +
geom_edge_fan(
lineend = 'round',
show.legend = FALSE,
edge_width = 0.5,
color = "white",
alpha = 0.3
) +
geom_node_label(aes(label = name, fill = group),
size = 4,
colour = 'white',
alpha = 0.8,
repel = TRUE
) +
rcartocolor::scale_colour_carto_d(palette = "Bold", guide = FALSE) +
labs(title = "Hollywood Network Clusters of Producer",
subtitle = "Producer that worked with the same composer",
caption = "Source: Ucinet Software") +
theme_graph() +
theme(
plot.background = element_rect(fill = "#212121"),
plot.title = element_text(color = "white", size = 40),
plot.subtitle = element_text(color = "white", size = 20),
plot.caption = element_text(color = "white", size = 12),
legend.position = "none",
)
my_graphAlthough the clusters are not perfect (Wallis in blue is in the red group for example), the algorithm makes a fairly clear distinction between the groups. It seems quite obvious that producers divide themselves into several groups, depending on their affinity with the same composers. It would be interesting to analyse the similarities between these composers. For example, do they produce the same kind of music? (classical, jazz, pop etc.) Unfortunately, such an analysis is not the aim of this project.
In this part, I simply add a shaded area to make the graphic more attractive. This is of course only a matter of personal taste.
# Extract the coordinates to keep the same layout
my_layout <- my_graph %>%
purrr::pluck('data') %>%
magrittr::extract(c('x', 'y'))
# Add coordinate to our data
producer_my_layout <- producer_cluster %>%
activate(nodes) %>%
mutate(x = my_layout$x,
y = my_layout$y)
# Producing with my layout and shaded areas
producer_my_layout %>%
ggraph(layout = 'manual', x = x, y = y) +
ggforce::geom_mark_hull(aes(x,y, group = group, fill = group),
concavity = 1,
expand = unit(8, "mm"),
colour = "#212121",
alpha = 0.25) +
geom_edge_fan(
lineend = 'round',
show.legend = FALSE,
edge_width = 0.5,
color = "white",
alpha = 0.1
) +
geom_node_label(aes(label = name, fill = group),
size = 4,
colour = 'white',
alpha = 0.7,
repel = TRUE
) +
rcartocolor::scale_colour_carto_d(palette = "Bold", guide = FALSE) +
labs(title = "Hollywood Network Clusters of Producer",
subtitle = "Producer that worked with the same composer",
caption = "Source: Ucinet Software") +
theme_graph() +
theme(
plot.background = element_rect(fill = "#212121"),
plot.title = element_text(color = "white", size = 40),
plot.subtitle = element_text(color = "white", size = 20),
plot.caption = element_text(color = "white", size = 12),
legend.position = "none",
) For the last part, we will leave the world of films behind and venture into the world of literary saga. And more particularly the saga Game of Thrones. I’ve read all the books, and I’d love to create a character overview highlighting the most important characters. Here we go!
Finding data on Game of Thrones was not particularly difficult. Luckily, I was even able to get clean data, requiring little preparation. That’s good, so I have more time for visualization and analysis. You can find the resources on Andrew Beveridge’s GitHub page who made his own network.
As mentioned before, importing the data doesn’t require many manipulations. You can click on the bottom right button if you would like to see the code.
Maybe just some explanations about the data. This network represents Characters interactions in the books of George R.R. Matin’s saga “A Song of Ice and Fire”.
Each node represent a character and each edge represents an interaction. The edge weight column corresponds to the number of interactions.
These networks were created by connecting two characters whenever their names (or nicknames) appeared within the 15 words of one another in one of the books.
# Import the data
got_edges <- read_csv("https://raw.githubusercontent.com/mathbeveridge/asoiaf/master/data/asoiaf-all-edges.csv")
got_nodes <- read_csv("https://raw.githubusercontent.com/mathbeveridge/asoiaf/master/data/asoiaf-all-nodes.csv")
# separate the Label into first name and last name for readability
got_nodes <- got_nodes %>%
separate(Label, into = c("first_name", "last_name"), sep = " ", extra = "merge", fill = "right")
# Transform data into one suitable "tbl_object"
got_graph <- tbl_graph(
nodes = got_nodes,
edges = got_edges,
directed = FALSE
) Who is the most important character? Jon Snow, Tyrion Lannister, Daenerys Targaryen or someone else? In fact, the answer might change according to how we plot the data. Science offers us many different metrics to measure the importance of a node in a network. There might be serveral answers. Let’s see.
To begin, I would like to start small and plot the 20 characters who had the most interactions in the books. For this, I use the “Weighted Degree Centrality” which is the sum of the weights of the edges incident with the node. This is the total number of interactions involving the character.
# Calculating weighted Degree centrality
got_graph <- got_graph %>%
activate(nodes) %>%
mutate(centrality = centrality_degree(weights = weight),
Id = row_number())
# Select the top 20 characters
top_20 <- got_graph %>%
activate(nodes) %>%
as_tibble() %>%
arrange(desc(centrality)) %>%
head(20) %>%
select(Id, first_name)
# Select the top 15 characters
top_15 <- got_graph %>%
activate(nodes) %>%
as_tibble() %>%
arrange(desc(centrality)) %>%
head(15) %>%
select(Id, first_name)
# highlight the top 15, 20 nodes and add clusters
got_graph <- got_graph %>%
activate(nodes) %>%
mutate(most_central_20 = as.factor(ifelse(first_name %in% top_20$first_name ,1,0)),
most_central_15 = as.factor(ifelse(first_name %in% top_15$first_name ,1,0)),
group= as.factor(group_fast_greedy()))
# Extract the groups created from group_louvain()
groups <- got_graph %>%
activate(nodes) %>%
as_tibble() %>%
select(Id, group)
# mutate the groups in the edges
got_graph <- got_graph %>%
activate(edges) %>%
inner_join(groups, by = c("to" = "Id")) %>%
activate(nodes)
# plot the graph centrality
ggraph(got_graph, layout = "centrality", cent = graph.strength(got_graph)) +
geom_edge_link0(aes(width = weight), edge_colour = "lightgrey", edge_alpha = 0.3) +
scale_edge_width(range = c(0.1, 10)) +
scale_edge_color_brewer(palette = "Set1") +
geom_node_point(aes(fill = group, size = centrality, alpha = most_central_20),
colour = "transparent",
shape = 21,
stroke = 0.3) +
geom_node_text(aes(label = ifelse(first_name %in% top_20$first_name, first_name, NA),
alpha = most_central_20,
size = centrality/5),
color = "black",
repel = TRUE) +
scale_fill_brewer(palette = "Set1", na.value = "gray53") +
scale_size(range = c(0.1, 30)) +
scale_alpha_discrete(range = c(0.5,1)) +
labs(title = "Characters centrality",
subtitle = "Top 20 Most Important Characters names of Game of Thrones based on number of interactions",
caption = "Source: Andrew Beveridge's data") +
theme_graph() +
theme(legend.position = "none",
plot.title = element_text(size = 40),
plot.subtitle = element_text( size = 20),
plot.caption = element_text( size = 12))From the number of interactions in the books, it seems that Tyrion is at the heart of the action, followed closely by Jon and Cersei. If you’ve seen the show, you might be surprised that Daenerys Targaryen doesn’t have a more central place. Indeed, she is a central character in the series. But don’t forget that the books are not yet finished, and that the character becomes more important at the end of the saga.
You certainly ask yourselves why I’ve chosen those colors? In fact, I used an algorithm, the Louvain’s method, that automatically create clusters. Intuitively speaking, this is a subset of the network that forms a self-contained and coherent sub-network. One common metric that people use for clusters detection is that there should be lots of edges within clusters, and fewer edges between clusters. This idea is captures by a quantity called modularity that I won’t define explicitly here (also because I don’t really know how it works). Given a network, we use standard techniques to split the network into communities so that modularity is maximized (approximately, anyway). One nice feature of this process is that we discover the number of communities, rather than picking the optimal number of communities up front.
Let’s use another layout (“The backbone layout”) in order to try to regroup those clusters.
# Plot the graph clusters
ggraph(got_graph, layout = "backbone") +
geom_edge_arc0(aes(width = weight, edge_colour = group), edge_alpha = 0.3) +
scale_edge_width(range = c(0.1, 2)) +
scale_edge_color_brewer(palette = "Set1") +
geom_node_point(aes(fill = group, size = centrality, alpha = most_central_20),
colour = "transparent",
shape = 21,
stroke = 0.3) +
geom_node_text(aes(label = ifelse(first_name %in% top_15$first_name, first_name, NA),
alpha = most_central_15,
size = centrality/5),
color = "black",
repel = TRUE) +
scale_fill_brewer(palette = "Set1", na.value = "gray53") +
scale_size(range = c(0.1, 15)) +
scale_alpha_discrete(range = c(0.5,1)) +
labs(title = "Characters clusters",
subtitle = "with the top 15 most important Character names of Game of Thrones grouped by location",
caption = "Source: Andrew Beveridge's data") +
theme_graph() +
theme(legend.position = "none",
plot.title = element_text( size = 40),
plot.subtitle = element_text(size = 20),
plot.caption = element_text(size = 12))Interesting isn’t it? Whithout saying anything, some clusters are automatically created. Here we can identify where the characters are located. There are many characters in King’s Landing fighting for the Iron Throne. There are the Stark (blue) who are in Winterfell, Jon and Samwell (green) who are at the Wall in Castle black and Daenerys who is in Essos. Note that it is only my interpretation. The algorithm just look at the inputs and provide some similarities based on what we want to demonstrate.
Finally, I wanted to know who are the most important Houses in Game of Thrones. In order to do this, I looked at their last names and tried to class them into the different Houses. Characters like Jon Snow are placed in Other Houses. Then I selected for each House the most important Character (with the most interactions) and ploted it on the map.
# adding the family names
got_graph <- got_graph %>%
mutate(house = case_when(str_detect(last_name, "Stark") ~ "House Stark",
str_detect(last_name, "Lannister") ~ "House Lannister",
str_detect(last_name, "Arryn") ~ "House Arryn",
str_detect(last_name, "Tyrell") ~ "House Tyrell",
str_detect(last_name, "Greyjoy") ~ "House Greyjoy",
str_detect(last_name, "Martell") ~ "House Martell",
str_detect(last_name, "Baratheon") ~ "House Baratheon",
str_detect(last_name, "Tully") ~ "House Tully",
str_detect(last_name, "Targaryen") ~ "House Targaryen",
TRUE ~ "Other Houses"))
# Select the most central character per family
got_graph <- got_graph %>%
activate(nodes) %>%
group_by(house) %>%
mutate(most_central_family = centrality == max(centrality)) %>%
ungroup()
# Extract house for each id as a tibble
group_family <- got_graph %>%
activate(nodes) %>%
as_tibble() %>%
select(Id, house)
# Mutate the houses in the edges
got_graph <- got_graph %>%
activate(edges) %>%
inner_join(group_family, by = c("to" = "Id")) %>%
activate(nodes)
# Add colors of families
got_graph <- got_graph %>%
activate(nodes) %>%
mutate(fill = case_when(
house %in% "House Stark" ~ "#A5A5A5",
house %in% "House Lannister" ~ "#610402",
house %in% "House Arryn" ~ "#1B3972",
house %in% "House Tyrell" ~ "#34540A",
house %in% "House Greyjoy" ~ "#0F0F0F",
house %in% "House Martell" ~ "#A06008",
house %in% "House Baratheon" ~ "#D0B607",
house %in% "House Tully" ~ "#072B77",
house %in% "House Targaryen" ~ "#171717",
TRUE ~ "lightgrey"
)) %>%
mutate(color= case_when(
house %in% "House Stark" ~ "black",
house %in% "House Lannister" ~ "#E1AC09",
house %in% "House Arryn" ~ "#C3C3C1",
house %in% "House Tyrell" ~ "#D7AD01",
house %in% "House Greyjoy" ~ "#EEC80C",
house %in% "House Martell" ~ "darkred",
house %in% "House Baratheon" ~ "#000000",
house %in% "House Tully" ~ "#C6C2C2",
house %in% "House Targaryen" ~ "#BA220E",
TRUE ~ "black"
))
# Set the color of the edges
color_nodes <- got_graph %>%
activate(nodes) %>%
as_tibble() %>%
select(Id, color)
# Mutate the groups in the edges
got_graph <- got_graph %>%
activate(edges) %>%
inner_join(color_nodes, by = c("to" = "Id")) %>%
activate(nodes)
# Plot the clusters with the houses colors
ggraph(got_graph, layout = "backbone") +
geom_edge_link0(aes(filter = house != "Other Houses",width = weight, edge_colour = color), edge_alpha = 0.2) +
scale_edge_color_identity() +
scale_edge_width(range = c(0.1, 2)) +
geom_node_point(aes(filter = house != "Other Houses", fill = fill,
size = centrality,
alpha = most_central_family,
colour = color),
shape = 21,
stroke = 1) +
geom_node_label(aes(filter = house != "Other Houses",
label = ifelse(most_central_family == 1, house, NA),
alpha = most_central_family,
size = centrality/5,
color = color,
fill = fill),
repel = TRUE
) +
scale_size(range = c(0.1, 15)) +
scale_alpha_discrete(range = c(0.8,1)) +
scale_fill_identity() +
scale_color_identity() +
labs(title = "Houses of Westeros",
caption = "Source: Andrew Beveridge's data") +
theme_graph() +
theme(legend.position = "none",
plot.title = element_text(size = 40),
plot.subtitle = element_text(size = 20),
plot.caption = element_text( size = 12))As expected the House Stark, Lannister and Baratheon (we are still in the middle of the saga, which is why Baratheon has it’s importance) are the most important Houses in Game of Thrones.
That’s it for the networks. Let’s see us in the next project !
A work by Valentin Monney